In Class Exercise 7

Author

Arjun Singh

Published

October 14, 2024

Modified

October 14, 2024

7 Introduction

In this exercise, we will work on reinforcing our learning from Hands-on Exercise 7 with further exercises.

7.1 Importing the packages

The following packages are imported into our environment to facilitate analysis.

  1. olsrr: Provides tools for building and evaluating Ordinary Least Squares (OLS) regression models, including diagnostic and selection methods.

  2. corrplot: A package for visualizing correlation matrices using different methods, such as color-coded heatmaps and circles.

  3. ggpubr: Facilitates easy creation of publication-ready plots based on ggplot2, with additional features for customization and statistical annotations.

  4. sf: Stands for Simple Features, providing support for handling, analyzing, and visualizing spatial data within R.

  5. spdep: Specializes in spatial dependence modeling and analysis, including spatial autocorrelation, spatial regression, and spatial weights generation.

  6. GWmodel: A package that implements Geographically Weighted Regression (GWR) and other geographically weighted models for spatial data analysis.

  7. tmap: Provides an intuitive syntax for creating thematic maps and handling spatial data, supporting both static and interactive maps.

  8. tidyverse: A collection of R packages designed for data science that share a common philosophy, including data manipulation (dplyr), visualization (ggplot2), and more.

  9. gtsummary: Simplifies the process of creating summary tables for statistical analyses, particularly useful for regression models and descriptive statistics.

The p_load() function of the pacman package is used as shown in the code chunk below.

pacman::p_load(olsrr, corrplot, ggpubr, sf, spdep, GWmodel, tmap, tidyverse, glue, ggstatsplot, sfdep)

7.2 Importing the data

7.2.1 Importing the geospatial data

We start off by importing the geospatial data into our environment. We use the st_read() function of the sf package for this.

mpsz = st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")
Reading layer `MP14_SUBZONE_WEB_PL' from data source 
  `C:\arjxn11\ISSS626-GAA\In-class_Ex\In-class_Ex7\data\geospatial' 
  using driver `ESRI Shapefile'
Simple feature collection with 323 features and 15 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21

This dataset is in ESRI shapefile format. The shapefile consists of URA Master Plan 2014’s planning subzone boundaries. Polygon features are used to represent these geographic boundaries. The GIS data is in svy21 projected coordinates systems. The geometry type is multipolygon.

We will now check the CRS information and update it if required.

EPSG code for Singapore is 3414.

We implement the st_crs() function of the sf package as shown in the code chunk below.

st_crs(mpsz)
Coordinate Reference System:
  User input: SVY21 
  wkt:
PROJCRS["SVY21",
    BASEGEOGCRS["SVY21[WGS84]",
        DATUM["World Geodetic System 1984",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ID["EPSG",6326]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["Degree",0.0174532925199433]]],
    CONVERSION["unnamed",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["Degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["(E)",east,
            ORDER[1],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]],
        AXIS["(N)",north,
            ORDER[2],
            LENGTHUNIT["metre",1,
                ID["EPSG",9001]]]]

We note that the current EPSG code is 9001, which is inaccurate. We must update this to 3414. The st_transform() function of the sf package will be implemented.

mpsz_svy21=st_transform(mpsz, 3414) 
st_crs(mpsz_svy21)
Coordinate Reference System:
  User input: EPSG:3414 
  wkt:
PROJCRS["SVY21 / Singapore TM",
    BASEGEOGCRS["SVY21",
        DATUM["SVY21",
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4757]],
    CONVERSION["Singapore Transverse Mercator",
        METHOD["Transverse Mercator",
            ID["EPSG",9807]],
        PARAMETER["Latitude of natural origin",1.36666666666667,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",103.833333333333,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["Scale factor at natural origin",1,
            SCALEUNIT["unity",1],
            ID["EPSG",8805]],
        PARAMETER["False easting",28001.642,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",38744.572,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["northing (N)",north,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["easting (E)",east,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Cadastre, engineering survey, topographic mapping."],
        AREA["Singapore - onshore and offshore."],
        BBOX[1.13,103.59,1.47,104.07]],
    ID["EPSG",3414]]

7.2.2 Importing and wrangling the Aspatial Data

Since this is in CSV format, we implement read_csv() of the readr package to import it.

Always be careful to use read_csv() rather than read.csv().

condo_resale = read_csv("data/aspatial/Condo_resale_2015.csv")

After importing the data file into R, it is important for us to examine if the data file has been imported correctly.

The codes chunks below uses glimpse() to display the data structure.

glimpse(condo_resale)
Rows: 1,436
Columns: 23
$ LATITUDE             <dbl> 1.287145, 1.328698, 1.313727, 1.308563, 1.321437,…
$ LONGITUDE            <dbl> 103.7802, 103.8123, 103.7971, 103.8247, 103.9505,…
$ POSTCODE             <dbl> 118635, 288420, 267833, 258380, 467169, 466472, 3…
$ SELLING_PRICE        <dbl> 3000000, 3880000, 3325000, 4250000, 1400000, 1320…
$ AREA_SQM             <dbl> 309, 290, 248, 127, 145, 139, 218, 141, 165, 168,…
$ AGE                  <dbl> 30, 32, 33, 7, 28, 22, 24, 24, 27, 31, 17, 22, 6,…
$ PROX_CBD             <dbl> 7.941259, 6.609797, 6.898000, 4.038861, 11.783402…
$ PROX_CHILDCARE       <dbl> 0.16597932, 0.28027246, 0.42922669, 0.39473543, 0…
$ PROX_ELDERLYCARE     <dbl> 2.5198118, 1.9333338, 0.5021395, 1.9910316, 1.121…
$ PROX_URA_GROWTH_AREA <dbl> 6.618741, 7.505109, 6.463887, 4.906512, 6.410632,…
$ PROX_HAWKER_MARKET   <dbl> 1.76542207, 0.54507614, 0.37789301, 1.68259969, 0…
$ PROX_KINDERGARTEN    <dbl> 0.05835552, 0.61592412, 0.14120309, 0.38200076, 0…
$ PROX_MRT             <dbl> 0.5607188, 0.6584461, 0.3053433, 0.6910183, 0.528…
$ PROX_PARK            <dbl> 1.1710446, 0.1992269, 0.2779886, 0.9832843, 0.116…
$ PROX_PRIMARY_SCH     <dbl> 1.6340256, 0.9747834, 1.4715016, 1.4546324, 0.709…
$ PROX_TOP_PRIMARY_SCH <dbl> 3.3273195, 0.9747834, 1.4715016, 2.3006394, 0.709…
$ PROX_SHOPPING_MALL   <dbl> 2.2102717, 2.9374279, 1.2256850, 0.3525671, 1.307…
$ PROX_SUPERMARKET     <dbl> 0.9103958, 0.5900617, 0.4135583, 0.4162219, 0.581…
$ PROX_BUS_STOP        <dbl> 0.10336166, 0.28673408, 0.28504777, 0.29872340, 0…
$ NO_Of_UNITS          <dbl> 18, 20, 27, 30, 30, 31, 32, 32, 32, 32, 34, 34, 3…
$ FAMILY_FRIENDLY      <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0…
$ FREEHOLD             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1…
$ LEASEHOLD_99YR       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
head(condo_resale$LONGITUDE) #see the data in XCOORD column
[1] 103.7802 103.8123 103.7971 103.8247 103.9505 103.9386
head(condo_resale$LATITUDE) #see the data in YCOORD column
[1] 1.287145 1.328698 1.313727 1.308563 1.321437 1.314198

We now implement the summary() function of base R to condo_resale.

summary(condo_resale)
    LATITUDE       LONGITUDE        POSTCODE      SELLING_PRICE     
 Min.   :1.240   Min.   :103.7   Min.   : 18965   Min.   :  540000  
 1st Qu.:1.309   1st Qu.:103.8   1st Qu.:259849   1st Qu.: 1100000  
 Median :1.328   Median :103.8   Median :469298   Median : 1383222  
 Mean   :1.334   Mean   :103.8   Mean   :440439   Mean   : 1751211  
 3rd Qu.:1.357   3rd Qu.:103.9   3rd Qu.:589486   3rd Qu.: 1950000  
 Max.   :1.454   Max.   :104.0   Max.   :828833   Max.   :18000000  
    AREA_SQM          AGE           PROX_CBD       PROX_CHILDCARE    
 Min.   : 34.0   Min.   : 0.00   Min.   : 0.3869   Min.   :0.004927  
 1st Qu.:103.0   1st Qu.: 5.00   1st Qu.: 5.5574   1st Qu.:0.174481  
 Median :121.0   Median :11.00   Median : 9.3567   Median :0.258135  
 Mean   :136.5   Mean   :12.14   Mean   : 9.3254   Mean   :0.326313  
 3rd Qu.:156.0   3rd Qu.:18.00   3rd Qu.:12.6661   3rd Qu.:0.368293  
 Max.   :619.0   Max.   :37.00   Max.   :19.1804   Max.   :3.465726  
 PROX_ELDERLYCARE  PROX_URA_GROWTH_AREA PROX_HAWKER_MARKET PROX_KINDERGARTEN 
 Min.   :0.05451   Min.   :0.2145       Min.   :0.05182    Min.   :0.004927  
 1st Qu.:0.61254   1st Qu.:3.1643       1st Qu.:0.55245    1st Qu.:0.276345  
 Median :0.94179   Median :4.6186       Median :0.90842    Median :0.413385  
 Mean   :1.05351   Mean   :4.5981       Mean   :1.27987    Mean   :0.458903  
 3rd Qu.:1.35122   3rd Qu.:5.7550       3rd Qu.:1.68578    3rd Qu.:0.578474  
 Max.   :3.94916   Max.   :9.1554       Max.   :5.37435    Max.   :2.229045  
    PROX_MRT         PROX_PARK       PROX_PRIMARY_SCH  PROX_TOP_PRIMARY_SCH
 Min.   :0.05278   Min.   :0.02906   Min.   :0.07711   Min.   :0.07711     
 1st Qu.:0.34646   1st Qu.:0.26211   1st Qu.:0.44024   1st Qu.:1.34451     
 Median :0.57430   Median :0.39926   Median :0.63505   Median :1.88213     
 Mean   :0.67316   Mean   :0.49802   Mean   :0.75471   Mean   :2.27347     
 3rd Qu.:0.84844   3rd Qu.:0.65592   3rd Qu.:0.95104   3rd Qu.:2.90954     
 Max.   :3.48037   Max.   :2.16105   Max.   :3.92899   Max.   :6.74819     
 PROX_SHOPPING_MALL PROX_SUPERMARKET PROX_BUS_STOP       NO_Of_UNITS    
 Min.   :0.0000     Min.   :0.0000   Min.   :0.001595   Min.   :  18.0  
 1st Qu.:0.5258     1st Qu.:0.3695   1st Qu.:0.098356   1st Qu.: 188.8  
 Median :0.9357     Median :0.5687   Median :0.151710   Median : 360.0  
 Mean   :1.0455     Mean   :0.6141   Mean   :0.193974   Mean   : 409.2  
 3rd Qu.:1.3994     3rd Qu.:0.7862   3rd Qu.:0.220466   3rd Qu.: 590.0  
 Max.   :3.4774     Max.   :2.2441   Max.   :2.476639   Max.   :1703.0  
 FAMILY_FRIENDLY     FREEHOLD      LEASEHOLD_99YR  
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :0.0000   Median :0.0000   Median :0.0000  
 Mean   :0.4868   Mean   :0.4227   Mean   :0.4882  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  

7.2.2.1 Converting aspatial data frame into a sf object

Currently, the condo_resale data frame is aspatial. We will convert it to a sf object. The code chunk below converts condo_resale data frame into a simple feature data frame by using st_as_sf() function of sf package.

condo_resale.sf <- st_as_sf(condo_resale,
                            coords = c("LONGITUDE", "LATITUDE"),
                            crs=4326) %>%
  st_transform(crs=3414)

Notice that st_transform() of sf package is used to convert the coordinates from wgs84 (i.e. crs:4326) to svy21 (i.e. crs=3414). We first set the CRS to 4326 as CSV data does not have the projection information.

Next, head() is used to list the content of condo_resale.sf object.

head(condo_resale.sf)
Simple feature collection with 6 features and 21 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 22085.12 ymin: 29951.54 xmax: 41042.56 ymax: 34546.2
Projected CRS: SVY21 / Singapore TM
# A tibble: 6 × 22
  POSTCODE SELLING_PRICE AREA_SQM   AGE PROX_CBD PROX_CHILDCARE PROX_ELDERLYCARE
     <dbl>         <dbl>    <dbl> <dbl>    <dbl>          <dbl>            <dbl>
1   118635       3000000      309    30     7.94          0.166            2.52 
2   288420       3880000      290    32     6.61          0.280            1.93 
3   267833       3325000      248    33     6.90          0.429            0.502
4   258380       4250000      127     7     4.04          0.395            1.99 
5   467169       1400000      145    28    11.8           0.119            1.12 
6   466472       1320000      139    22    10.3           0.125            0.789
# ℹ 15 more variables: PROX_URA_GROWTH_AREA <dbl>, PROX_HAWKER_MARKET <dbl>,
#   PROX_KINDERGARTEN <dbl>, PROX_MRT <dbl>, PROX_PARK <dbl>,
#   PROX_PRIMARY_SCH <dbl>, PROX_TOP_PRIMARY_SCH <dbl>,
#   PROX_SHOPPING_MALL <dbl>, PROX_SUPERMARKET <dbl>, PROX_BUS_STOP <dbl>,
#   NO_Of_UNITS <dbl>, FAMILY_FRIENDLY <dbl>, FREEHOLD <dbl>,
#   LEASEHOLD_99YR <dbl>, geometry <POINT [m]>

Notice that the output is in point feature data frame. This is because it has only latitude and longitude.

7.3 Exploratory Data Analysis

7.3.1 EDA using statistical graphics

We can plot the distribution of SELLING_PRICE by using appropriate Exploratory Data Analysis (EDA) as shown in the code chunk below.

ggplot(data=condo_resale.sf, aes(x=`SELLING_PRICE`)) +   geom_histogram(bins=20, color="black", fill="light blue")

The figure above reveals a right skewed distribution. This means that more condominium units were transacted at relative lower prices.

Statistically, the skewed dsitribution can be normalised by using log transformation. The code chunk below is used to derive a new variable called LOG_SELLING_PRICE by using a log transformation on the variable SELLING_PRICE. It is performed using mutate() of dplyr package.

condo_resale.sf <- condo_resale.sf %>%   mutate(`LOG_SELLING_PRICE` = log(SELLING_PRICE))

Now, you can plot the LOG_SELLING_PRICE using the code chunk below.

ggplot(data=condo_resale.sf, aes(x=`LOG_SELLING_PRICE`)) +   geom_histogram(bins=20, color="black", fill="light blue")

Notice that the distribution is relatively less skewed after the transformation.

7.3.2 Multiple Histogram Plots distribution of variables

We now plot multiple histograms (also known as trellis plot) by using the ggarrange() function of the ggpubr package.

The code chunk below is used to create 12 histograms. Then, ggarrange() is used to organised these histogram into a 3 columns by 4 rows small multiple plot.

AREA_SQM <- ggplot(data=condo_resale.sf, aes(x= `AREA_SQM`)) +    geom_histogram(bins=20, color="black", fill="light blue")  
AGE <- ggplot(data=condo_resale.sf, aes(x= `AGE`)) +   geom_histogram(bins=20, color="black", fill="light blue") 
PROX_CBD <- ggplot(data=condo_resale.sf, aes(x= `PROX_CBD`)) +   geom_histogram(bins=20, color="black", fill="light blue")  
PROX_CHILDCARE <- ggplot(data=condo_resale.sf, aes(x= `PROX_CHILDCARE`)) +    geom_histogram(bins=20, color="black", fill="light blue") 
PROX_ELDERLYCARE <- ggplot(data=condo_resale.sf, aes(x= `PROX_ELDERLYCARE`)) +   geom_histogram(bins=20, color="black", fill="light blue")  
PROX_URA_GROWTH_AREA <- ggplot(data=condo_resale.sf,                                 aes(x= `PROX_URA_GROWTH_AREA`)) +   geom_histogram(bins=20, color="black", fill="light blue")  
PROX_HAWKER_MARKET <- ggplot(data=condo_resale.sf, aes(x= `PROX_HAWKER_MARKET`)) +   geom_histogram(bins=20, color="black", fill="light blue")  
PROX_KINDERGARTEN <- ggplot(data=condo_resale.sf, aes(x= `PROX_KINDERGARTEN`)) +   geom_histogram(bins=20, color="black", fill="light blue")  
PROX_MRT <- ggplot(data=condo_resale.sf, aes(x= `PROX_MRT`)) +   geom_histogram(bins=20, color="black", fill="light blue")  
PROX_PARK <- ggplot(data=condo_resale.sf, aes(x= `PROX_PARK`)) +   geom_histogram(bins=20, color="black", fill="light blue")  
PROX_PRIMARY_SCH <- ggplot(data=condo_resale.sf, aes(x= `PROX_PRIMARY_SCH`)) +   geom_histogram(bins=20, color="black", fill="light blue")  
PROX_TOP_PRIMARY_SCH <- ggplot(data=condo_resale.sf,                                 aes(x= `PROX_TOP_PRIMARY_SCH`)) +   geom_histogram(bins=20, color="black", fill="light blue") 
ggarrange(AREA_SQM, AGE, PROX_CBD, PROX_CHILDCARE, PROX_ELDERLYCARE,            PROX_URA_GROWTH_AREA, PROX_HAWKER_MARKET, PROX_KINDERGARTEN, PROX_MRT,           PROX_PARK, PROX_PRIMARY_SCH, PROX_TOP_PRIMARY_SCH,             ncol = 3, nrow = 4)

7.3.3 Drawing a statistical point map

Lastly, we want to reveal the geospatial distribution condominium resale prices in Singapore. The map will be prepared by using tmap package.

tmap_mode("view") 
tm_shape(mpsz_svy21)+   
  tm_polygons() + 
  tm_shape(condo_resale.sf) +     
  tm_dots(col = "SELLING_PRICE", alpha = 0.6,style="quantile")+   
  tm_view(set.zoom.limits = c(11,14))+   
  tmap_options(check.and.fix = TRUE)

We change tmap_mode back to plot before proceeding.

tmap_mode('plot')

7.4 Hedonic Pricing Modelling

Hedonic pricing modeling is an econometric technique used to estimate the value of a good or service by breaking down the price into its component attributes. Commonly applied in real estate, it involves analyzing how individual factors such as location, size, amenities, or proximity to schools influence the overall market price of a property. This model helps in understanding how much each characteristic contributes to the price, separating the effect of specific features from the overall value.

We implement the lm() function of base R to build hedonic pricing models for condominium resale units.

7.4.1 Simple Linear Regression Method

First, we will build a simple linear regression model by using SELLING_PRICE as the dependent variable and AREA_SQM as the independent variable.

condo.slr <- lm(formula=SELLING_PRICE ~ AREA_SQM, data = condo_resale.sf)

lm() returns an object of class “lm” or for multiple responses of class c(“mlm”, “lm”).

The functions summary() and anova() can be used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by lm.

summary(condo.slr)

Call:
lm(formula = SELLING_PRICE ~ AREA_SQM, data = condo_resale.sf)

Residuals:
     Min       1Q   Median       3Q      Max 
-3695815  -391764   -87517   258900 13503875 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -258121.1    63517.2  -4.064 5.09e-05 ***
AREA_SQM      14719.0      428.1  34.381  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 942700 on 1434 degrees of freedom
Multiple R-squared:  0.4518,    Adjusted R-squared:  0.4515 
F-statistic:  1182 on 1 and 1434 DF,  p-value: < 2.2e-16

The R-squared value of 0.4518 indicates that the simple regression model explains approximately 45% of the variation in resale prices.

Given that the p-value is much smaller than 0.0001, we can confidently reject the null hypothesis that the mean is an adequate predictor of the SELLING_PRICE. This suggests that the simple linear regression model is a significantly better estimator of SELLING_PRICE.

The Coefficients section of the report shows that the p-values for both the Intercept and ARA_SQM estimates are less than 0.001. This allows us to reject the null hypothesis that B0 (the intercept) and B1 (the slope for ARA_SQM) are equal to zero. Therefore, we can conclude that B0 and B1 are reliable parameter estimates.

To visualize the best fit line on a scatterplot, we can use the lm() method within ggplot’s geometry functions, as demonstrated in the following code snippet.

ggplot(data=condo_resale.sf,          aes(x=`AREA_SQM`, y=`SELLING_PRICE`)) +   geom_point() +   geom_smooth(method = lm)

The figure above reveals that there are indeed a few statistical outliers with relatively high selling prices.

7.4.2 Multiple Linear Regression Method

7.4.2.1 Visualising the relationships of the independent variables

Before building a multiple regression model, it is important to ensure that the indepdent variables used are not highly correlated to each other. If these highly correlated independent variables are used in building a regression model by mistake, the quality of the model will be compromised. This phenomenon is known as multicollinearity in statistics.

Correlation matrix is commonly used to visualise the relationships between the independent variables. Beside the pairs() of R, there are many packages support the display of a correlation matrix. In this section, the corrplot package will be used.

The code chunk below is used to plot a scatterplot matrix of the relationship between the independent variables in condo_resale data.frame.

corrplot(cor(condo_resale[, 5:23]), 
         diag = FALSE, order = "AOE",          
         tl.pos = "td", tl.cex = 0.5, 
         method = "number", type = "upper")

Matrix reorder is very important for mining the hiden structure and patter in the matrix. There are four methods in corrplot (parameter order), named “AOE”, “FPC”, “hclust”, “alphabet”. In the code chunk above, AOE order is used. It orders the variables by using the angular order of the eigenvectors method suggested by Michael Friendly.

From the scatterplot matrix, it is clear that Freehold is highly correlated to LEASE_99YEAR. In view of this, it is wiser to only include either one of them in the subsequent model building. As a result, LEASE_99YEAR is excluded in the subsequent model building.

ggcorrmat(condo_resale[, 5:23])

7.4.3 Building a hedonic pricing model using multiple linear regression method

The code chunk below using lm() to calibrate the multiple linear regression model.

Note that we have added an additional variable LEASEHOLD_99YR to the below model for this in-class exercise.

condo.mlr <- lm(formula = SELLING_PRICE ~ AREA_SQM + AGE+
                  PROX_CBD + PROX_CHILDCARE + PROX_ELDERLYCARE + 
                  PROX_URA_GROWTH_AREA + PROX_HAWKER_MARKET + PROX_KINDERGARTEN + 
                  PROX_MRT  + PROX_PARK + PROX_PRIMARY_SCH +  
                  PROX_TOP_PRIMARY_SCH + PROX_SHOPPING_MALL + PROX_SUPERMARKET +  
                  PROX_BUS_STOP + NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD+ LEASEHOLD_99YR,
                data=condo_resale.sf) 
summary(condo.mlr)

Call:
lm(formula = SELLING_PRICE ~ AREA_SQM + AGE + PROX_CBD + PROX_CHILDCARE + 
    PROX_ELDERLYCARE + PROX_URA_GROWTH_AREA + PROX_HAWKER_MARKET + 
    PROX_KINDERGARTEN + PROX_MRT + PROX_PARK + PROX_PRIMARY_SCH + 
    PROX_TOP_PRIMARY_SCH + PROX_SHOPPING_MALL + PROX_SUPERMARKET + 
    PROX_BUS_STOP + NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD + 
    LEASEHOLD_99YR, data = condo_resale.sf)

Residuals:
     Min       1Q   Median       3Q      Max 
-3471036  -286903   -22426   239412 12254549 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           543071.4   136210.9   3.987 7.03e-05 ***
AREA_SQM               12688.7      370.1  34.283  < 2e-16 ***
AGE                   -24566.0     2766.0  -8.881  < 2e-16 ***
PROX_CBD              -78122.0     6791.4 -11.503  < 2e-16 ***
PROX_CHILDCARE       -333219.0   111020.3  -3.001 0.002734 ** 
PROX_ELDERLYCARE      170950.0    42110.8   4.060 5.19e-05 ***
PROX_URA_GROWTH_AREA   38507.6    12523.7   3.075 0.002147 ** 
PROX_HAWKER_MARKET     23801.2    29299.9   0.812 0.416739    
PROX_KINDERGARTEN     144098.0    82738.7   1.742 0.081795 .  
PROX_MRT             -322775.9    58528.1  -5.515 4.14e-08 ***
PROX_PARK             564487.9    66563.0   8.481  < 2e-16 ***
PROX_PRIMARY_SCH      186170.5    65515.2   2.842 0.004553 ** 
PROX_TOP_PRIMARY_SCH    -477.1    20598.0  -0.023 0.981525    
PROX_SHOPPING_MALL   -207721.5    42855.5  -4.847 1.39e-06 ***
PROX_SUPERMARKET      -48074.7    77145.3  -0.623 0.533273    
PROX_BUS_STOP         675755.0   138552.0   4.877 1.20e-06 ***
NO_Of_UNITS             -216.2       90.3  -2.394 0.016797 *  
FAMILY_FRIENDLY       142128.3    47055.1   3.020 0.002569 ** 
FREEHOLD              300646.5    77296.5   3.890 0.000105 ***
LEASEHOLD_99YR        -77137.4    77570.9  -0.994 0.320192    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 755800 on 1416 degrees of freedom
Multiple R-squared:  0.652, Adjusted R-squared:  0.6474 
F-statistic: 139.6 on 19 and 1416 DF,  p-value: < 2.2e-16

7.4.4 Preparing Publication Quality Table: olsrr method

With reference to the report above, it is clear that not all the independent variables are statistically significant. We will revised the model by removing those variables which are not statistically significant.

Now, we are ready to calibrate the revised model by using the code chunk below.

condo.mlr1 <- lm(formula = SELLING_PRICE ~ AREA_SQM + AGE + 
                   PROX_CBD + PROX_CHILDCARE + PROX_ELDERLYCARE +
                   PROX_URA_GROWTH_AREA + PROX_MRT  + PROX_PARK + 
                   PROX_PRIMARY_SCH + PROX_SHOPPING_MALL    + PROX_BUS_STOP + 
                   NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD+ LEASEHOLD_99YR,
                 data=condo_resale.sf)
ols_regress(condo.mlr1)
                                Model Summary                                 
-----------------------------------------------------------------------------
R                            0.807       RMSE                     751634.509 
R-Squared                    0.651       MSE                571320119490.610 
Adj. R-Squared               0.647       Coef. Var                    43.162 
Pred R-Squared               0.637       AIC                       42967.367 
MAE                     413020.461       SBC                       43056.951 
-----------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                     ANOVA                                       
--------------------------------------------------------------------------------
                    Sum of                                                      
                   Squares          DF         Mean Square       F         Sig. 
--------------------------------------------------------------------------------
Regression    1.513372e+15          15        1.008915e+14    176.594    0.0000 
Residual      8.112746e+14        1420    571320119490.610                      
Total         2.324647e+15        1435                                          
--------------------------------------------------------------------------------

                                               Parameter Estimates                                                
-----------------------------------------------------------------------------------------------------------------
               model           Beta    Std. Error    Std. Beta       t        Sig           lower          upper 
-----------------------------------------------------------------------------------------------------------------
         (Intercept)     591539.643    121110.937                   4.884    0.000     353964.068     829115.218 
            AREA_SQM      12754.325       367.962        0.582     34.662    0.000      12032.517      13476.133 
                 AGE     -24822.087      2756.860       -0.168     -9.004    0.000     -30230.043     -19414.132 
            PROX_CBD     -76833.361      5767.956       -0.262    -13.321    0.000     -88147.991     -65518.730 
      PROX_CHILDCARE    -297608.214    109400.497       -0.078     -2.720    0.007    -512212.168     -83004.261 
    PROX_ELDERLYCARE     183303.549     39943.561        0.089      4.589    0.000     104948.823     261658.276 
PROX_URA_GROWTH_AREA      39752.039     11763.983        0.061      3.379    0.001      16675.385      62828.692 
            PROX_MRT    -305114.878     57591.189       -0.116     -5.298    0.000    -418087.828    -192141.927 
           PROX_PARK     572038.799     65511.407        0.150      8.732    0.000     443529.265     700548.334 
    PROX_PRIMARY_SCH     164542.899     60358.977        0.064      2.726    0.006      46140.557     282945.241 
  PROX_SHOPPING_MALL    -220515.279     36558.846       -0.115     -6.032    0.000    -292230.427    -148800.131 
       PROX_BUS_STOP     674997.951    134646.651        0.133      5.013    0.000     410870.234     939125.668 
         NO_Of_UNITS       -228.616        89.102       -0.049     -2.566    0.010       -403.402        -53.830 
     FAMILY_FRIENDLY     148152.863     46913.189        0.058      3.158    0.002      56126.263     240179.463 
            FREEHOLD     281136.713     76537.974        0.109      3.673    0.000     130997.067     431276.358 
      LEASEHOLD_99YR     -89655.454     76421.659       -0.035     -1.173    0.241    -239566.931      60256.022 
-----------------------------------------------------------------------------------------------------------------

We use the olsrrr package to generate a report for the model we created above (condo.mlr), similar to the method shown in the above code chunk.

ols_regress(condo.mlr)
                                Model Summary                                 
-----------------------------------------------------------------------------
R                            0.807       RMSE                     750537.537 
R-Squared                    0.652       MSE                571262902261.223 
Adj. R-Squared               0.647       Coef. Var                    43.160 
Pred R-Squared               0.637       AIC                       42971.173 
MAE                     412117.987       SBC                       43081.835 
-----------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                     ANOVA                                       
--------------------------------------------------------------------------------
                    Sum of                                                      
                   Squares          DF         Mean Square       F         Sig. 
--------------------------------------------------------------------------------
Regression    1.515738e+15          19        7.977571e+13    139.648    0.0000 
Residual      8.089083e+14        1416    571262902261.223                      
Total         2.324647e+15        1435                                          
--------------------------------------------------------------------------------

                                               Parameter Estimates                                                
-----------------------------------------------------------------------------------------------------------------
               model           Beta    Std. Error    Std. Beta       t        Sig           lower          upper 
-----------------------------------------------------------------------------------------------------------------
         (Intercept)     543071.420    136210.918                   3.987    0.000     275874.535     810268.305 
            AREA_SQM      12688.669       370.119        0.579     34.283    0.000      11962.627      13414.710 
                 AGE     -24566.001      2766.041       -0.166     -8.881    0.000     -29991.980     -19140.022 
            PROX_CBD     -78121.985      6791.377       -0.267    -11.503    0.000     -91444.227     -64799.744 
      PROX_CHILDCARE    -333219.036    111020.303       -0.087     -3.001    0.003    -551000.984    -115437.089 
    PROX_ELDERLYCARE     170949.961     42110.748        0.083      4.060    0.000      88343.803     253556.120 
PROX_URA_GROWTH_AREA      38507.622     12523.661        0.059      3.075    0.002      13940.700      63074.545 
  PROX_HAWKER_MARKET      23801.197     29299.923        0.019      0.812    0.417     -33674.725      81277.120 
   PROX_KINDERGARTEN     144097.972     82738.669        0.030      1.742    0.082     -18205.570     306401.514 
            PROX_MRT    -322775.874     58528.079       -0.123     -5.515    0.000    -437586.937    -207964.811 
           PROX_PARK     564487.876     66563.011        0.148      8.481    0.000     433915.162     695060.590 
    PROX_PRIMARY_SCH     186170.524     65515.193        0.072      2.842    0.005      57653.253     314687.795 
PROX_TOP_PRIMARY_SCH       -477.073     20597.972       -0.001     -0.023    0.982     -40882.894      39928.747 
  PROX_SHOPPING_MALL    -207721.520     42855.500       -0.109     -4.847    0.000    -291788.613    -123654.427 
    PROX_SUPERMARKET     -48074.679     77145.257       -0.012     -0.623    0.533    -199405.956     103256.599 
       PROX_BUS_STOP     675755.044    138551.991        0.133      4.877    0.000     403965.817     947544.272 
         NO_Of_UNITS       -216.180        90.302       -0.046     -2.394    0.017       -393.320        -39.040 
     FAMILY_FRIENDLY     142128.272     47055.082        0.056      3.020    0.003      49823.107     234433.438 
            FREEHOLD     300646.543     77296.529        0.117      3.890    0.000     149018.525     452274.561 
      LEASEHOLD_99YR     -77137.375     77570.869       -0.030     -0.994    0.320    -229303.551      75028.801 
-----------------------------------------------------------------------------------------------------------------
  • This method even shows you the meaning of RMSE and other important statistics and facilitates analysis further.

  • It is a better method to use for detailed output compared to the basic lm() method.

  • Adjusted R-squared value tells you how good your model is by informing you how much of the variation in price our model accounts for.

Multicollinearity check using vif

We can check for multicollinearity using the code chunk below. (vif= variance inflation factor)

ols_vif_tol(condo.mlr)
              Variables Tolerance      VIF
1              AREA_SQM 0.8601326 1.162611
2                   AGE 0.7011585 1.426211
3              PROX_CBD 0.4575471 2.185567
4        PROX_CHILDCARE 0.2898233 3.450378
5      PROX_ELDERLYCARE 0.5922238 1.688551
6  PROX_URA_GROWTH_AREA 0.6614081 1.511926
7    PROX_HAWKER_MARKET 0.4373874 2.286303
8     PROX_KINDERGARTEN 0.8356793 1.196631
9              PROX_MRT 0.4949877 2.020252
10            PROX_PARK 0.8015728 1.247547
11     PROX_PRIMARY_SCH 0.3823248 2.615577
12 PROX_TOP_PRIMARY_SCH 0.4878620 2.049760
13   PROX_SHOPPING_MALL 0.4903052 2.039546
14     PROX_SUPERMARKET 0.6142127 1.628100
15        PROX_BUS_STOP 0.3311024 3.020213
16          NO_Of_UNITS 0.6543336 1.528272
17      FAMILY_FRIENDLY 0.7191719 1.390488
18             FREEHOLD 0.2728521 3.664990
19       LEASEHOLD_99YR 0.2645988 3.779307

Higher vif indicates (over 5) multicollinearity. Above 10 means they must be eliminated from the model.

In our case, it shows that we do not need to eliminate them.

Highly continuous correlated variables impact the model more so than correlated dummy variables would.

Variable Selection using Stepwise Regression

There is forward and backward stepwise regression.

  1. Forward: Start with no variables and add them one by one based on model improvement.

  2. Backward: Start with all variables and remove them one by one based on model deterioration.

condo_fw_mlr=ols_step_forward_p(condo.mlr,
                                 p_val = 0.05,
                                 details = FALSE)
condo_fw_mlr

                                     Stepwise Summary                                      
-----------------------------------------------------------------------------------------
Step    Variable                   AIC          SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------------------
 0      Base Model              44449.068    44459.608    40371.745    0.00000    0.00000 
 1      AREA_SQM                43587.753    43603.562    39510.883    0.45184    0.45146 
 2      PROX_CBD                43243.523    43264.602    39167.182    0.56928    0.56868 
 3      PROX_PARK               43177.691    43204.039    39101.331    0.58915    0.58829 
 4      FREEHOLD                43125.474    43157.092    39049.179    0.60438    0.60327 
 5      AGE                     43069.222    43106.109    38993.167    0.62010    0.61878 
 6      PROX_ELDERLYCARE        43046.515    43088.672    38970.548    0.62659    0.62502 
 7      PROX_SHOPPING_MALL      43020.990    43068.417    38945.209    0.63367    0.63188 
 8      PROX_URA_GROWTH_AREA    43009.092    43061.788    38933.407    0.63720    0.63517 
 9      PROX_MRT                42999.058    43057.024    38923.483    0.64023    0.63796 
 10     PROX_BUS_STOP           42984.951    43048.186    38909.582    0.64424    0.64175 
 11     FAMILY_FRIENDLY         42981.085    43049.590    38905.797    0.64569    0.64296 
 12     NO_Of_UNITS             42975.246    43049.021    38900.092    0.64762    0.64465 
 13     PROX_CHILDCARE          42971.858    43050.902    38896.812    0.64894    0.64573 
 14     PROX_PRIMARY_SCH        42966.758    43051.072    38891.872    0.65067    0.64723 
-----------------------------------------------------------------------------------------

Final Model Output 
------------------

                                Model Summary                                 
-----------------------------------------------------------------------------
R                            0.807       RMSE                     751998.679 
R-Squared                    0.651       MSE                571471422208.591 
Adj. R-Squared               0.647       Coef. Var                    43.168 
Pred R-Squared               0.638       AIC                       42966.758 
MAE                     414819.628       SBC                       43051.072 
-----------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                     ANOVA                                       
--------------------------------------------------------------------------------
                    Sum of                                                      
                   Squares          DF         Mean Square       F         Sig. 
--------------------------------------------------------------------------------
Regression    1.512586e+15          14        1.080418e+14    189.059    0.0000 
Residual      8.120609e+14        1421    571471422208.591                      
Total         2.324647e+15        1435                                          
--------------------------------------------------------------------------------

                                               Parameter Estimates                                                
-----------------------------------------------------------------------------------------------------------------
               model           Beta    Std. Error    Std. Beta       t        Sig           lower          upper 
-----------------------------------------------------------------------------------------------------------------
         (Intercept)     527633.222    108183.223                   4.877    0.000     315417.244     739849.200 
            AREA_SQM      12777.523       367.479        0.584     34.771    0.000      12056.663      13498.382 
            PROX_CBD     -77131.323      5763.125       -0.263    -13.384    0.000     -88436.469     -65826.176 
           PROX_PARK     570504.807     65507.029        0.150      8.709    0.000     442003.938     699005.677 
            FREEHOLD     350599.812     48506.485        0.136      7.228    0.000     255447.802     445751.821 
                 AGE     -24687.739      2754.845       -0.167     -8.962    0.000     -30091.739     -19283.740 
    PROX_ELDERLYCARE     185575.623     39901.864        0.090      4.651    0.000     107302.737     263848.510 
  PROX_SHOPPING_MALL    -220947.251     36561.832       -0.115     -6.043    0.000    -292668.213    -149226.288 
PROX_URA_GROWTH_AREA      39163.254     11754.829        0.060      3.332    0.001      16104.571      62221.936 
            PROX_MRT    -294745.107     56916.367       -0.112     -5.179    0.000    -406394.234    -183095.980 
       PROX_BUS_STOP     682482.221    134513.243        0.134      5.074    0.000     418616.359     946348.082 
     FAMILY_FRIENDLY     146307.576     46893.021        0.057      3.120    0.002      54320.593     238294.560 
         NO_Of_UNITS       -245.480        87.947       -0.053     -2.791    0.005       -418.000        -72.961 
      PROX_CHILDCARE    -318472.751    107959.512       -0.084     -2.950    0.003    -530249.889    -106695.613 
    PROX_PRIMARY_SCH     159856.136     60234.599        0.062      2.654    0.008      41697.849     278014.424 
-----------------------------------------------------------------------------------------------------------------
plot(condo_fw_mlr)

condo_bw_mlr=ols_step_backward_p(condo.mlr,
                    p_val = 0.05,
                    details = FALSE)
condo_bw_mlr

                                     Stepwise Summary                                      
-----------------------------------------------------------------------------------------
Step    Variable                   AIC          SBC         SBIC         R2       Adj. R2 
-----------------------------------------------------------------------------------------
 0      Full Model              42971.173    43081.835    38896.546    0.65203    0.64736 
 1      PROX_TOP_PRIMARY_SCH    42969.173    43074.565    38894.518    0.65203    0.64761 
 2      PROX_SUPERMARKET        42967.567    43067.689    38892.873    0.65193    0.64776 
 3      PROX_HAWKER_MARKET      42966.461    43061.315    38891.719    0.65172    0.64779 
 4      LEASEHOLD_99YR          42965.558    43055.141    38890.764    0.65145    0.64777 
 5      PROX_KINDERGARTEN       42966.758    43051.072    38891.872    0.65067    0.64723 
-----------------------------------------------------------------------------------------

Final Model Output 
------------------

                                Model Summary                                 
-----------------------------------------------------------------------------
R                            0.807       RMSE                     751998.679 
R-Squared                    0.651       MSE                571471422208.591 
Adj. R-Squared               0.647       Coef. Var                    43.168 
Pred R-Squared               0.638       AIC                       42966.758 
MAE                     414819.628       SBC                       43051.072 
-----------------------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                     ANOVA                                       
--------------------------------------------------------------------------------
                    Sum of                                                      
                   Squares          DF         Mean Square       F         Sig. 
--------------------------------------------------------------------------------
Regression    1.512586e+15          14        1.080418e+14    189.059    0.0000 
Residual      8.120609e+14        1421    571471422208.591                      
Total         2.324647e+15        1435                                          
--------------------------------------------------------------------------------

                                               Parameter Estimates                                                
-----------------------------------------------------------------------------------------------------------------
               model           Beta    Std. Error    Std. Beta       t        Sig           lower          upper 
-----------------------------------------------------------------------------------------------------------------
         (Intercept)     527633.222    108183.223                   4.877    0.000     315417.244     739849.200 
            AREA_SQM      12777.523       367.479        0.584     34.771    0.000      12056.663      13498.382 
                 AGE     -24687.739      2754.845       -0.167     -8.962    0.000     -30091.739     -19283.740 
            PROX_CBD     -77131.323      5763.125       -0.263    -13.384    0.000     -88436.469     -65826.176 
      PROX_CHILDCARE    -318472.751    107959.512       -0.084     -2.950    0.003    -530249.889    -106695.613 
    PROX_ELDERLYCARE     185575.623     39901.864        0.090      4.651    0.000     107302.737     263848.510 
PROX_URA_GROWTH_AREA      39163.254     11754.829        0.060      3.332    0.001      16104.571      62221.936 
            PROX_MRT    -294745.107     56916.367       -0.112     -5.179    0.000    -406394.234    -183095.980 
           PROX_PARK     570504.807     65507.029        0.150      8.709    0.000     442003.938     699005.677 
    PROX_PRIMARY_SCH     159856.136     60234.599        0.062      2.654    0.008      41697.849     278014.424 
  PROX_SHOPPING_MALL    -220947.251     36561.832       -0.115     -6.043    0.000    -292668.213    -149226.288 
       PROX_BUS_STOP     682482.221    134513.243        0.134      5.074    0.000     418616.359     946348.082 
         NO_Of_UNITS       -245.480        87.947       -0.053     -2.791    0.005       -418.000        -72.961 
     FAMILY_FRIENDLY     146307.576     46893.021        0.057      3.120    0.002      54320.593     238294.560 
            FREEHOLD     350599.812     48506.485        0.136      7.228    0.000     255447.802     445751.821 
-----------------------------------------------------------------------------------------------------------------
plot(condo_bw_mlr)

7.4.5 Preparing Publication Quality Table: gtsummary method

The broom package provides an elegant and flexible way to create publication-ready summary tables in R.

In the code chunk below, tidy() function is used to create a well formatted regression report.

broom::tidy(condo.mlr1, intercept = TRUE)
# A tibble: 16 × 5
   term                 estimate std.error statistic   p.value
   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
 1 (Intercept)           591540.  121111.       4.88 1.16e-  6
 2 AREA_SQM               12754.     368.      34.7  2.84e-191
 3 AGE                   -24822.    2757.      -9.00 6.84e- 19
 4 PROX_CBD              -76833.    5768.     -13.3  3.11e- 38
 5 PROX_CHILDCARE       -297608.  109400.      -2.72 6.60e-  3
 6 PROX_ELDERLYCARE      183304.   39944.       4.59 4.85e-  6
 7 PROX_URA_GROWTH_AREA   39752.   11764.       3.38 7.47e-  4
 8 PROX_MRT             -305115.   57591.      -5.30 1.36e-  7
 9 PROX_PARK             572039.   65511.       8.73 6.91e- 18
10 PROX_PRIMARY_SCH      164543.   60359.       2.73 6.49e-  3
11 PROX_SHOPPING_MALL   -220515.   36559.      -6.03 2.06e-  9
12 PROX_BUS_STOP         674998.  134647.       5.01 6.03e-  7
13 NO_Of_UNITS             -229.      89.1     -2.57 1.04e-  2
14 FAMILY_FRIENDLY       148153.   46913.       3.16 1.62e-  3
15 FREEHOLD              281137.   76538.       3.67 2.48e-  4
16 LEASEHOLD_99YR        -89655.   76422.      -1.17 2.41e-  1

7.4.6 Checking for Multicollinearity

In this section, we use anl R package designed specifically for conducting OLS (Ordinary Least Squares) regression analysis—olsrr. This package offers a wide range of valuable tools to help you build more robust multiple linear regression models. Its key features include:

  • Comprehensive regression output
  • Diagnostic tests for residuals
  • Influence measures for identifying outliers
  • Tests for heteroskedasticity
  • Collinearity diagnostics to detect multicollinearity
  • Model fit assessment
  • Evaluation of variable contributions
  • Various methods for variable selection

In the code snippet below, we demonstrate how to use the ols_vif_tol() function from the olsrr package to assess potential multicollinearity among predictors in your regression model.

ols_vif_tol(condo.mlr1)
              Variables Tolerance      VIF
1              AREA_SQM 0.8703348 1.148983
2                   AGE 0.7059074 1.416616
3              PROX_CBD 0.6343823 1.576337
4        PROX_CHILDCARE 0.2984991 3.350094
5      PROX_ELDERLYCARE 0.6582967 1.519072
6  PROX_URA_GROWTH_AREA 0.7496642 1.333931
7              PROX_MRT 0.5112747 1.955896
8             PROX_PARK 0.8275963 1.208319
9      PROX_PRIMARY_SCH 0.4504807 2.219851
10   PROX_SHOPPING_MALL 0.6738111 1.484095
11        PROX_BUS_STOP 0.3506229 2.852067
12          NO_Of_UNITS 0.6721417 1.487781
13      FAMILY_FRIENDLY 0.7236014 1.381976
14             FREEHOLD 0.2783151 3.593049
15       LEASEHOLD_99YR 0.2726438 3.667789

Since the VIF of the independent variables are less than 10. We can safely conclude that there are no sign of multicollinearity among the independent variables.

7.4.6.1 Test for Non-Linearity

In multiple linear regression, it is important for us to test the assumption that linearity and additivity of the relationship between dependent and independent variables.

In the code chunk below, the ols_plot_resid_fit() of olsrr package is used to perform linearity assumption test.

Notice that we pass the newly generated forward stepwise model we created above. We explicitly state that it must use the model.

ols_plot_resid_fit(condo_fw_mlr$model)

The figure above reveals that most of the data points are scattered around the 0 line, hence we can safely conclude that the relationships between the dependent variable and independent variables are linear.

7.4.6.2 Test for Normality Assumption

The code chunk below uses ols_plot_resid_hist() of olsrr package to perform normality assumption test.

ols_plot_resid_hist(condo_fw_mlr$model)

The figure reveals that the residual of the multiple linear regression model (i.e. condo.mlr1) is resemble normal distribution.

If you prefer formal statistical test methods, the ols_test_normality() of olsrr package can be used as shown in the code chun below.

ols_test_normality(condo_fw_mlr$model)
-----------------------------------------------
       Test             Statistic       pvalue  
-----------------------------------------------
Shapiro-Wilk              0.6856         0.0000 
Kolmogorov-Smirnov        0.1366         0.0000 
Cramer-von Mises         121.0768        0.0000 
Anderson-Darling         67.9551         0.0000 
-----------------------------------------------

The summary table above reveals that the p-values of the four tests are way smaller than the alpha value of 0.05. Hence we will reject the null hypothesis and infer that there is statistical evidence that the residual are not normally distributed.

7.4.6.3 Testing for Spatial Autocorrelation

The hedonic model we try to build are using geographically referenced attributes, hence it is also important for us to visual the residual of the hedonic pricing model.

In order to perform spatial autocorrelation test, we need to convert condo_resale.sf from sf data frame into a SpatialPointsDataFrame.

First, we will export the residual of the hedonic pricing model and save it as a data frame.

mlr.output <- as.data.frame(condo_fw_mlr$model$residuals)%>%
rename(`FW_MLR_RES` = `condo_fw_mlr$model$residuals`)

Next, we will join the newly created data frame with condo_resale.sf object.

condo_resale.sf <- cbind(condo_resale.sf, 
                        mlr.output$`FW_MLR_RES`)%>%
  rename('MLR_RES'='mlr.output.FW_MLR_RES')

Next, we will convert condo_resale.res.sf from simple feature object into a SpatialPointsDataFrame because spdep package can only process sp conformed spatial data objects.

Next, we will use tmap package to display the distribution of the residuals on an interactive map.

tmap_mode("view")
tm_shape(mpsz_svy21)+
  tmap_options(check.and.fix = TRUE) + # If we know that this particular layer is cause us issues, we specifically mention check.and.fix=TRUE here.
  tm_polygons(alpha = 0.4) +
tm_shape(condo_resale.sf) +  
  tm_dots(col = "MLR_RES",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))
tmap_mode('plot')

Residuals show the difference between actual transaction price and estimated price by the model.

TO prove that our observation is indeed true, the Moran’s I test will be performed.

We first compute the distance-based weight matrix by using the st_knn() function of the sfdep package.

condo_resale.sf= condo_resale.sf%>%
  mutate(nb=st_knn(geometry, k=6, 
                   longlat=FALSE),
         wt= st_weights(nb, 
                        style='W'),
         .before=1)
global_moran_perm(condo_resale.sf$MLR_RES, 
                  condo_resale.sf$nb,
                  condo_resale.sf$wt,
                  alternative='two.sided',
                  nsim=99)

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 100 

statistic = 0.32254, observed rank = 100, p-value < 2.2e-16
alternative hypothesis: two.sided

The global Moran’s I test for residual spatial autocorrelation shows that it’s P value is less than 0.05, meaning we have sufficient evidence to reject the null hypothesis that the residuals are randomly distributed.

Since the observed Global Moran I=0.25586, which is greater than 0, we can infer that the residuals resemble cluster distribution.

7.5 Building Hedonic Pricing Models using GWModel

7.5.1 Building Fixed Bandwidth GWR Model

7.5.1.1 Computing fixed bandwith

In the code chunk below, the bw.gwr() function of the GWModel package is used to determine the optimal fixed bandwidth to use in the model.

Notice that the argument adaptive is set to FALSE indicates that we are interested to compute the fixed bandwidth.

There are two possible approaches can be used to determine the stopping rule.

  • CV cross-validation approach

  • AIC corrected (AICc) approach.

We define the stopping rule using approach argeement.

bw.fixed <- bw.gwr(formula = SELLING_PRICE ~ AREA_SQM + AGE + PROX_CBD + 
                     PROX_CHILDCARE + PROX_ELDERLYCARE  + PROX_URA_GROWTH_AREA + 
                     PROX_MRT   + PROX_PARK + PROX_PRIMARY_SCH + 
                     PROX_SHOPPING_MALL + PROX_BUS_STOP + NO_Of_UNITS + 
                     FAMILY_FRIENDLY + FREEHOLD, 
                   data=condo_resale.sf, 
                   approach="CV", 
                   kernel="gaussian", 
                   adaptive=FALSE, 
                   longlat=FALSE)
Fixed bandwidth: 17660.96 CV score: 8.259118e+14 
Fixed bandwidth: 10917.26 CV score: 7.970454e+14 
Fixed bandwidth: 6749.419 CV score: 7.273273e+14 
Fixed bandwidth: 4173.553 CV score: 6.300006e+14 
Fixed bandwidth: 2581.58 CV score: 5.404958e+14 
Fixed bandwidth: 1597.687 CV score: 4.857515e+14 
Fixed bandwidth: 989.6077 CV score: 4.722431e+14 
Fixed bandwidth: 613.7939 CV score: 1.378294e+16 
Fixed bandwidth: 1221.873 CV score: 4.778717e+14 
Fixed bandwidth: 846.0596 CV score: 4.791629e+14 
Fixed bandwidth: 1078.325 CV score: 4.751406e+14 
Fixed bandwidth: 934.7772 CV score: 4.72518e+14 
Fixed bandwidth: 1023.495 CV score: 4.730305e+14 
Fixed bandwidth: 968.6643 CV score: 4.721317e+14 
Fixed bandwidth: 955.7206 CV score: 4.722072e+14 
Fixed bandwidth: 976.6639 CV score: 4.721387e+14 
Fixed bandwidth: 963.7202 CV score: 4.721484e+14 
Fixed bandwidth: 971.7199 CV score: 4.721293e+14 
Fixed bandwidth: 973.6083 CV score: 4.721309e+14 
Fixed bandwidth: 970.5527 CV score: 4.721295e+14 
Fixed bandwidth: 972.4412 CV score: 4.721296e+14 
Fixed bandwidth: 971.2741 CV score: 4.721292e+14 
Fixed bandwidth: 970.9985 CV score: 4.721293e+14 
Fixed bandwidth: 971.4443 CV score: 4.721292e+14 
Fixed bandwidth: 971.5496 CV score: 4.721293e+14 
Fixed bandwidth: 971.3793 CV score: 4.721292e+14 
Fixed bandwidth: 971.3391 CV score: 4.721292e+14 
Fixed bandwidth: 971.3143 CV score: 4.721292e+14 
Fixed bandwidth: 971.3545 CV score: 4.721292e+14 
Fixed bandwidth: 971.3296 CV score: 4.721292e+14 
Fixed bandwidth: 971.345 CV score: 4.721292e+14 
Fixed bandwidth: 971.3355 CV score: 4.721292e+14 
Fixed bandwidth: 971.3413 CV score: 4.721292e+14 
Fixed bandwidth: 971.3377 CV score: 4.721292e+14 
Fixed bandwidth: 971.34 CV score: 4.721292e+14 
Fixed bandwidth: 971.3405 CV score: 4.721292e+14 
Fixed bandwidth: 971.3408 CV score: 4.721292e+14 
Fixed bandwidth: 971.3403 CV score: 4.721292e+14 
Fixed bandwidth: 971.3406 CV score: 4.721292e+14 
Fixed bandwidth: 971.3404 CV score: 4.721292e+14 
Fixed bandwidth: 971.3405 CV score: 4.721292e+14 
Fixed bandwidth: 971.3405 CV score: 4.721292e+14 

The result shows that the recommended bandwidth is 971.3405 metres.

7.5.1.2 GWModel method - fixed bandwith

Now we can use the code chunk below to calibrate the gwr model using fixed bandwidth and gaussian kernel.

gwr.fixed <- bw.gwr(formula = SELLING_PRICE ~ AREA_SQM + AGE + PROX_CBD + 
                         PROX_CHILDCARE + PROX_ELDERLYCARE  + PROX_URA_GROWTH_AREA + 
                         PROX_MRT   + PROX_PARK + PROX_PRIMARY_SCH + 
                         PROX_SHOPPING_MALL + PROX_BUS_STOP + NO_Of_UNITS + 
                         FAMILY_FRIENDLY + FREEHOLD, 
                       data=condo_resale.sf, 
                       approach='CV', 
                       kernel = 'gaussian', 
                    adaptive = FALSE,
                       longlat = FALSE)
Fixed bandwidth: 17660.96 CV score: 8.259118e+14 
Fixed bandwidth: 10917.26 CV score: 7.970454e+14 
Fixed bandwidth: 6749.419 CV score: 7.273273e+14 
Fixed bandwidth: 4173.553 CV score: 6.300006e+14 
Fixed bandwidth: 2581.58 CV score: 5.404958e+14 
Fixed bandwidth: 1597.687 CV score: 4.857515e+14 
Fixed bandwidth: 989.6077 CV score: 4.722431e+14 
Fixed bandwidth: 613.7939 CV score: 1.378294e+16 
Fixed bandwidth: 1221.873 CV score: 4.778717e+14 
Fixed bandwidth: 846.0596 CV score: 4.791629e+14 
Fixed bandwidth: 1078.325 CV score: 4.751406e+14 
Fixed bandwidth: 934.7772 CV score: 4.72518e+14 
Fixed bandwidth: 1023.495 CV score: 4.730305e+14 
Fixed bandwidth: 968.6643 CV score: 4.721317e+14 
Fixed bandwidth: 955.7206 CV score: 4.722072e+14 
Fixed bandwidth: 976.6639 CV score: 4.721387e+14 
Fixed bandwidth: 963.7202 CV score: 4.721484e+14 
Fixed bandwidth: 971.7199 CV score: 4.721293e+14 
Fixed bandwidth: 973.6083 CV score: 4.721309e+14 
Fixed bandwidth: 970.5527 CV score: 4.721295e+14 
Fixed bandwidth: 972.4412 CV score: 4.721296e+14 
Fixed bandwidth: 971.2741 CV score: 4.721292e+14 
Fixed bandwidth: 970.9985 CV score: 4.721293e+14 
Fixed bandwidth: 971.4443 CV score: 4.721292e+14 
Fixed bandwidth: 971.5496 CV score: 4.721293e+14 
Fixed bandwidth: 971.3793 CV score: 4.721292e+14 
Fixed bandwidth: 971.3391 CV score: 4.721292e+14 
Fixed bandwidth: 971.3143 CV score: 4.721292e+14 
Fixed bandwidth: 971.3545 CV score: 4.721292e+14 
Fixed bandwidth: 971.3296 CV score: 4.721292e+14 
Fixed bandwidth: 971.345 CV score: 4.721292e+14 
Fixed bandwidth: 971.3355 CV score: 4.721292e+14 
Fixed bandwidth: 971.3413 CV score: 4.721292e+14 
Fixed bandwidth: 971.3377 CV score: 4.721292e+14 
Fixed bandwidth: 971.34 CV score: 4.721292e+14 
Fixed bandwidth: 971.3405 CV score: 4.721292e+14 
Fixed bandwidth: 971.3408 CV score: 4.721292e+14 
Fixed bandwidth: 971.3403 CV score: 4.721292e+14 
Fixed bandwidth: 971.3406 CV score: 4.721292e+14 
Fixed bandwidth: 971.3404 CV score: 4.721292e+14 
Fixed bandwidth: 971.3405 CV score: 4.721292e+14 
Fixed bandwidth: 971.3405 CV score: 4.721292e+14 

We use the code chunk below to display the model created above.

gwr.fixed
[1] 971.3405
gwr.fixed<- gwr.basic(formula = SELLING_PRICE ~ AREA_SQM + AGE + 
                            PROX_CBD + PROX_CHILDCARE + PROX_ELDERLYCARE + 
                            PROX_URA_GROWTH_AREA + PROX_MRT + PROX_PARK + 
                            PROX_PRIMARY_SCH + PROX_SHOPPING_MALL + PROX_BUS_STOP + 
                            NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD, 
                          data=condo_resale.sf,
                      bw=bw.fixed, 
                          kernel = 'gaussian', 
                          
                          longlat = FALSE)

The output is saved in a list of flass ‘gwrm’. The code chunk below is used to display the model output.

gwr.fixed
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2024-10-17 13:24:34.435927 
   Call:
   gwr.basic(formula = SELLING_PRICE ~ AREA_SQM + AGE + PROX_CBD + 
    PROX_CHILDCARE + PROX_ELDERLYCARE + PROX_URA_GROWTH_AREA + 
    PROX_MRT + PROX_PARK + PROX_PRIMARY_SCH + PROX_SHOPPING_MALL + 
    PROX_BUS_STOP + NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD, 
    data = condo_resale.sf, bw = bw.fixed, kernel = "gaussian", 
    longlat = FALSE)

   Dependent (y) variable:  SELLING_PRICE
   Independent variables:  AREA_SQM AGE PROX_CBD PROX_CHILDCARE PROX_ELDERLYCARE PROX_URA_GROWTH_AREA PROX_MRT PROX_PARK PROX_PRIMARY_SCH PROX_SHOPPING_MALL PROX_BUS_STOP NO_Of_UNITS FAMILY_FRIENDLY FREEHOLD
   Number of data points: 1436
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
     Min       1Q   Median       3Q      Max 
-3470778  -298119   -23481   248917 12234210 

   Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
   (Intercept)           527633.22  108183.22   4.877 1.20e-06 ***
   AREA_SQM               12777.52     367.48  34.771  < 2e-16 ***
   AGE                   -24687.74    2754.84  -8.962  < 2e-16 ***
   PROX_CBD              -77131.32    5763.12 -13.384  < 2e-16 ***
   PROX_CHILDCARE       -318472.75  107959.51  -2.950 0.003231 ** 
   PROX_ELDERLYCARE      185575.62   39901.86   4.651 3.61e-06 ***
   PROX_URA_GROWTH_AREA   39163.25   11754.83   3.332 0.000885 ***
   PROX_MRT             -294745.11   56916.37  -5.179 2.56e-07 ***
   PROX_PARK             570504.81   65507.03   8.709  < 2e-16 ***
   PROX_PRIMARY_SCH      159856.14   60234.60   2.654 0.008046 ** 
   PROX_SHOPPING_MALL   -220947.25   36561.83  -6.043 1.93e-09 ***
   PROX_BUS_STOP         682482.22  134513.24   5.074 4.42e-07 ***
   NO_Of_UNITS             -245.48      87.95  -2.791 0.005321 ** 
   FAMILY_FRIENDLY       146307.58   46893.02   3.120 0.001845 ** 
   FREEHOLD              350599.81   48506.48   7.228 7.98e-13 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 756000 on 1421 degrees of freedom
   Multiple R-squared: 0.6507
   Adjusted R-squared: 0.6472 
   F-statistic: 189.1 on 14 and 1421 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 8.120609e+14
   Sigma(hat): 752522.9
   AIC:  42966.76
   AICc:  42967.14
   BIC:  41731.39
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Fixed bandwidth: 971.3405 
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                               Min.     1st Qu.      Median     3rd Qu.
   Intercept            -3.5988e+07 -5.1998e+05  7.6780e+05  1.7412e+06
   AREA_SQM              1.0003e+03  5.2758e+03  7.4740e+03  1.2301e+04
   AGE                  -1.3475e+05 -2.0813e+04 -8.6260e+03 -3.7784e+03
   PROX_CBD             -7.7047e+07 -2.3608e+05 -8.3600e+04  3.4646e+04
   PROX_CHILDCARE       -6.0097e+06 -3.3667e+05 -9.7425e+04  2.9007e+05
   PROX_ELDERLYCARE     -3.5000e+06 -1.5970e+05  3.1971e+04  1.9577e+05
   PROX_URA_GROWTH_AREA -3.0170e+06 -8.2013e+04  7.0749e+04  2.2612e+05
   PROX_MRT             -3.5282e+06 -6.5836e+05 -1.8833e+05  3.6922e+04
   PROX_PARK            -1.2062e+06 -2.1732e+05  3.5383e+04  4.1335e+05
   PROX_PRIMARY_SCH     -2.2695e+07 -1.7066e+05  4.8472e+04  5.1555e+05
   PROX_SHOPPING_MALL   -7.2585e+06 -1.6684e+05 -1.0517e+04  1.5923e+05
   PROX_BUS_STOP        -1.4676e+06 -4.5207e+04  3.7601e+05  1.1664e+06
   NO_Of_UNITS          -1.3170e+03 -2.4822e+02 -3.0846e+01  2.5496e+02
   FAMILY_FRIENDLY      -2.2749e+06 -1.1140e+05  7.6214e+03  1.6107e+05
   FREEHOLD             -9.2067e+06  3.8073e+04  1.5169e+05  3.7528e+05
                             Max.
   Intercept            112793548
   AREA_SQM                 21575
   AGE                     434201
   PROX_CBD               2704596
   PROX_CHILDCARE         1654087
   PROX_ELDERLYCARE      38867814
   PROX_URA_GROWTH_AREA  78515730
   PROX_MRT               3124316
   PROX_PARK             18122425
   PROX_PRIMARY_SCH       4637503
   PROX_SHOPPING_MALL     1529952
   PROX_BUS_STOP         11342182
   NO_Of_UNITS              12907
   FAMILY_FRIENDLY        1720744
   FREEHOLD               6073636
   ************************Diagnostic information*************************
   Number of data points: 1436 
   Effective number of parameters (2trace(S) - trace(S'S)): 438.3804 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 997.6196 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 42263.61 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 41632.36 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 42515.71 
   Residual sum of squares: 2.53407e+14 
   R-square value:  0.8909912 
   Adjusted R-square value:  0.8430417 

   ***********************************************************************
   Program stops at: 2024-10-17 13:24:36.082047 

Notice the change, the improvement, in Adjusted R-Squared Value.

7.5.2 Building Adaptive Bandwidth GWR Model

7.5.2.1 Computing the adaptive bandwidth

Similar to the earlier section, we will first use bw.gwr() to determine the recommended data point to use.

The code chunk used look very similar to the one used to compute the fixed bandwidth except the adaptive argument has changed to TRUE.

bw.adaptive <- bw.gwr(formula = SELLING_PRICE ~ AREA_SQM + AGE  + 
                        PROX_CBD + PROX_CHILDCARE + PROX_ELDERLYCARE    + 
                        PROX_URA_GROWTH_AREA + PROX_MRT + PROX_PARK + 
                        PROX_PRIMARY_SCH + PROX_SHOPPING_MALL   + PROX_BUS_STOP + 
                        NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD, 
                      data=condo_resale.sf, 
                      approach="CV", 
                      kernel="gaussian", 
                      adaptive=TRUE, 
                      longlat=FALSE)
Adaptive bandwidth: 895 CV score: 7.952401e+14 
Adaptive bandwidth: 561 CV score: 7.667364e+14 
Adaptive bandwidth: 354 CV score: 6.953454e+14 
Adaptive bandwidth: 226 CV score: 6.15223e+14 
Adaptive bandwidth: 147 CV score: 5.674373e+14 
Adaptive bandwidth: 98 CV score: 5.426745e+14 
Adaptive bandwidth: 68 CV score: 5.168117e+14 
Adaptive bandwidth: 49 CV score: 4.859631e+14 
Adaptive bandwidth: 37 CV score: 4.646518e+14 
Adaptive bandwidth: 30 CV score: 4.422088e+14 
Adaptive bandwidth: 25 CV score: 4.430816e+14 
Adaptive bandwidth: 32 CV score: 4.505602e+14 
Adaptive bandwidth: 27 CV score: 4.462172e+14 
Adaptive bandwidth: 30 CV score: 4.422088e+14 

Constructing Adaptive Model

gwr.adaptive <- gwr.basic(formula = SELLING_PRICE ~ AREA_SQM + AGE + 
                            PROX_CBD + PROX_CHILDCARE + PROX_ELDERLYCARE + 
                            PROX_URA_GROWTH_AREA + PROX_MRT + PROX_PARK + 
                            PROX_PRIMARY_SCH + PROX_SHOPPING_MALL + PROX_BUS_STOP + 
                            NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD, 
                          data=condo_resale.sf, bw=bw.adaptive, 
                          kernel = 'gaussian', 
                          adaptive=TRUE, 
                          longlat = FALSE)

The code chunk below can be used to display the model output.

gwr.adaptive
   ***********************************************************************
   *                       Package   GWmodel                             *
   ***********************************************************************
   Program starts at: 2024-10-17 13:24:46.474889 
   Call:
   gwr.basic(formula = SELLING_PRICE ~ AREA_SQM + AGE + PROX_CBD + 
    PROX_CHILDCARE + PROX_ELDERLYCARE + PROX_URA_GROWTH_AREA + 
    PROX_MRT + PROX_PARK + PROX_PRIMARY_SCH + PROX_SHOPPING_MALL + 
    PROX_BUS_STOP + NO_Of_UNITS + FAMILY_FRIENDLY + FREEHOLD, 
    data = condo_resale.sf, bw = bw.adaptive, kernel = "gaussian", 
    adaptive = TRUE, longlat = FALSE)

   Dependent (y) variable:  SELLING_PRICE
   Independent variables:  AREA_SQM AGE PROX_CBD PROX_CHILDCARE PROX_ELDERLYCARE PROX_URA_GROWTH_AREA PROX_MRT PROX_PARK PROX_PRIMARY_SCH PROX_SHOPPING_MALL PROX_BUS_STOP NO_Of_UNITS FAMILY_FRIENDLY FREEHOLD
   Number of data points: 1436
   ***********************************************************************
   *                    Results of Global Regression                     *
   ***********************************************************************

   Call:
    lm(formula = formula, data = data)

   Residuals:
     Min       1Q   Median       3Q      Max 
-3470778  -298119   -23481   248917 12234210 

   Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
   (Intercept)           527633.22  108183.22   4.877 1.20e-06 ***
   AREA_SQM               12777.52     367.48  34.771  < 2e-16 ***
   AGE                   -24687.74    2754.84  -8.962  < 2e-16 ***
   PROX_CBD              -77131.32    5763.12 -13.384  < 2e-16 ***
   PROX_CHILDCARE       -318472.75  107959.51  -2.950 0.003231 ** 
   PROX_ELDERLYCARE      185575.62   39901.86   4.651 3.61e-06 ***
   PROX_URA_GROWTH_AREA   39163.25   11754.83   3.332 0.000885 ***
   PROX_MRT             -294745.11   56916.37  -5.179 2.56e-07 ***
   PROX_PARK             570504.81   65507.03   8.709  < 2e-16 ***
   PROX_PRIMARY_SCH      159856.14   60234.60   2.654 0.008046 ** 
   PROX_SHOPPING_MALL   -220947.25   36561.83  -6.043 1.93e-09 ***
   PROX_BUS_STOP         682482.22  134513.24   5.074 4.42e-07 ***
   NO_Of_UNITS             -245.48      87.95  -2.791 0.005321 ** 
   FAMILY_FRIENDLY       146307.58   46893.02   3.120 0.001845 ** 
   FREEHOLD              350599.81   48506.48   7.228 7.98e-13 ***

   ---Significance stars
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
   Residual standard error: 756000 on 1421 degrees of freedom
   Multiple R-squared: 0.6507
   Adjusted R-squared: 0.6472 
   F-statistic: 189.1 on 14 and 1421 DF,  p-value: < 2.2e-16 
   ***Extra Diagnostic information
   Residual sum of squares: 8.120609e+14
   Sigma(hat): 752522.9
   AIC:  42966.76
   AICc:  42967.14
   BIC:  41731.39
   ***********************************************************************
   *          Results of Geographically Weighted Regression              *
   ***********************************************************************

   *********************Model calibration information*********************
   Kernel function: gaussian 
   Adaptive bandwidth: 30 (number of nearest neighbours)
   Regression points: the same locations as observations are used.
   Distance metric: Euclidean distance metric is used.

   ****************Summary of GWR coefficient estimates:******************
                               Min.     1st Qu.      Median     3rd Qu.
   Intercept            -1.3487e+08 -2.4669e+05  7.7928e+05  1.6194e+06
   AREA_SQM              3.3188e+03  5.6285e+03  7.7825e+03  1.2738e+04
   AGE                  -9.6746e+04 -2.9288e+04 -1.4043e+04 -5.6119e+03
   PROX_CBD             -2.5330e+06 -1.6256e+05 -7.7242e+04  2.6624e+03
   PROX_CHILDCARE       -1.2790e+06 -2.0175e+05  8.7158e+03  3.7778e+05
   PROX_ELDERLYCARE     -1.6212e+06 -9.2050e+04  6.1029e+04  2.8184e+05
   PROX_URA_GROWTH_AREA -7.2686e+06 -3.0350e+04  4.5869e+04  2.4613e+05
   PROX_MRT             -4.3781e+07 -6.7282e+05 -2.2115e+05 -7.4593e+04
   PROX_PARK            -2.9020e+06 -1.6782e+05  1.1601e+05  4.6572e+05
   PROX_PRIMARY_SCH     -8.6418e+05 -1.6627e+05 -7.7853e+03  4.3222e+05
   PROX_SHOPPING_MALL   -1.8272e+06 -1.3175e+05 -1.4049e+04  1.3799e+05
   PROX_BUS_STOP        -2.0579e+06 -7.1461e+04  4.1104e+05  1.2071e+06
   NO_Of_UNITS          -2.1993e+03 -2.3685e+02 -3.4699e+01  1.1657e+02
   FAMILY_FRIENDLY      -5.9879e+05 -5.0927e+04  2.6173e+04  2.2481e+05
   FREEHOLD             -1.6340e+05  4.0765e+04  1.9023e+05  3.7960e+05
                            Max.
   Intercept            18758355
   AREA_SQM                23064
   AGE                     13303
   PROX_CBD             11346650
   PROX_CHILDCARE        2892127
   PROX_ELDERLYCARE      2465671
   PROX_URA_GROWTH_AREA  7384059
   PROX_MRT              1186242
   PROX_PARK             2588497
   PROX_PRIMARY_SCH      3381462
   PROX_SHOPPING_MALL   38038564
   PROX_BUS_STOP        12081592
   NO_Of_UNITS              1010
   FAMILY_FRIENDLY       2072414
   FREEHOLD              1813995
   ************************Diagnostic information*************************
   Number of data points: 1436 
   Effective number of parameters (2trace(S) - trace(S'S)): 350.3088 
   Effective degrees of freedom (n-2trace(S) + trace(S'S)): 1085.691 
   AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 41982.22 
   AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 41546.74 
   BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 41914.08 
   Residual sum of squares: 2.528227e+14 
   R-square value:  0.8912425 
   Adjusted R-square value:  0.8561185 

   ***********************************************************************
   Program stops at: 2024-10-17 13:24:47.943393 

7.5.3 Visualising GWR Output

In addition to the regression residuals, the output feature class table provides several key metrics, including observed and predicted values, the condition number (cond), Local R², residuals, and the coefficients with their standard errors for the explanatory variables:

  • Condition Number: This diagnostic assesses local collinearity. When strong local collinearity is present, model results become unstable. A condition number greater than 30 suggests that the results may be unreliable due to multicollinearity.

  • Local R²: Values range from 0.0 to 1.0 and indicate the goodness-of-fit of the local regression model. Low Local R² values signal poor model performance in those regions. Mapping these values can help identify areas where the Geographically Weighted Regression (GWR) model is performing well and where it is underperforming, potentially highlighting missing or unaccounted-for variables.

  • Predicted Values: These are the fitted y values estimated by the GWR model.

  • Residuals: Residuals are calculated by subtracting the fitted y values from the observed y values. Standardized residuals have a mean of zero and a standard deviation of one. A gradient map (cold-to-hot) of standardized residuals can be created to visualize areas of model under- or overestimation.

  • Coefficient Standard Errors: These values reflect the reliability of each coefficient estimate. Smaller standard errors relative to the actual coefficients indicate higher confidence in the estimates. Large standard errors, however, may suggest issues with local collinearity.

All of these metrics are stored within a SpatialPointsDataFrame or SpatialPolygonsDataFrame object, integrated with the fit points, GWR coefficient estimates, observed and predicted y values, coefficient standard errors, and t-values in the “data” slot of an object called SDF within the output list.

7.5.4 Converting SDF into sf data.frame

::: insights-box
SDF provides you data of the intercepts.
:::
::: {.cell}

```{.r .cell-code}
gwr.adaptive.output=as.data.frame(gwr.adaptive$SDF)%>%
  select(-c(2:15)) # this removes the unnecessary columns and makes your work tidier. 

:::

gwr_sf_adaptive=cbind(condo_resale.sf, gwr.adaptive.output)

Next, glimpse() and summary() are used to display the content and summary of condo_resale.sf.adaptive sf data frame.

glimpse(gwr_sf_adaptive)
Rows: 1,436
Columns: 64
$ nb                      <nb> <66, 77, 123, 238, 239, 343>, <21, 162, 163, 19…
$ wt                      <list> <0.1666667, 0.1666667, 0.1666667, 0.1666667, …
$ POSTCODE                <dbl> 118635, 288420, 267833, 258380, 467169, 466472…
$ SELLING_PRICE           <dbl> 3000000, 3880000, 3325000, 4250000, 1400000, 1…
$ AREA_SQM                <dbl> 309, 290, 248, 127, 145, 139, 218, 141, 165, 1…
$ AGE                     <dbl> 30, 32, 33, 7, 28, 22, 24, 24, 27, 31, 17, 22,…
$ PROX_CBD                <dbl> 7.941259, 6.609797, 6.898000, 4.038861, 11.783…
$ PROX_CHILDCARE          <dbl> 0.16597932, 0.28027246, 0.42922669, 0.39473543…
$ PROX_ELDERLYCARE        <dbl> 2.5198118, 1.9333338, 0.5021395, 1.9910316, 1.…
$ PROX_URA_GROWTH_AREA    <dbl> 6.618741, 7.505109, 6.463887, 4.906512, 6.4106…
$ PROX_HAWKER_MARKET      <dbl> 1.76542207, 0.54507614, 0.37789301, 1.68259969…
$ PROX_KINDERGARTEN       <dbl> 0.05835552, 0.61592412, 0.14120309, 0.38200076…
$ PROX_MRT                <dbl> 0.5607188, 0.6584461, 0.3053433, 0.6910183, 0.…
$ PROX_PARK               <dbl> 1.1710446, 0.1992269, 0.2779886, 0.9832843, 0.…
$ PROX_PRIMARY_SCH        <dbl> 1.6340256, 0.9747834, 1.4715016, 1.4546324, 0.…
$ PROX_TOP_PRIMARY_SCH    <dbl> 3.3273195, 0.9747834, 1.4715016, 2.3006394, 0.…
$ PROX_SHOPPING_MALL      <dbl> 2.2102717, 2.9374279, 1.2256850, 0.3525671, 1.…
$ PROX_SUPERMARKET        <dbl> 0.9103958, 0.5900617, 0.4135583, 0.4162219, 0.…
$ PROX_BUS_STOP           <dbl> 0.10336166, 0.28673408, 0.28504777, 0.29872340…
$ NO_Of_UNITS             <dbl> 18, 20, 27, 30, 30, 31, 32, 32, 32, 32, 34, 34…
$ FAMILY_FRIENDLY         <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0…
$ FREEHOLD                <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1…
$ LEASEHOLD_99YR          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ LOG_SELLING_PRICE       <dbl> 14.91412, 15.17135, 15.01698, 15.26243, 14.151…
$ MLR_RES                 <dbl> -1489099.55, 415494.57, 194129.69, 1088992.71,…
$ Intercept               <dbl> 2050011.67, 1633128.24, 3433608.17, 234358.91,…
$ y                       <dbl> 3000000, 3880000, 3325000, 4250000, 1400000, 1…
$ yhat                    <dbl> 2886531.8, 3466801.5, 3616527.2, 5435481.6, 13…
$ residual                <dbl> 113468.16, 413198.52, -291527.20, -1185481.63,…
$ CV_Score                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ Stud_residual           <dbl> 0.38207013, 1.01433140, -0.83780678, -2.846146…
$ Intercept_SE            <dbl> 516105.5, 488083.5, 963711.4, 444185.5, 211962…
$ AREA_SQM_SE             <dbl> 823.2860, 825.2380, 988.2240, 617.4007, 1376.2…
$ AGE_SE                  <dbl> 5889.782, 6226.916, 6510.236, 6010.511, 8180.3…
$ PROX_CBD_SE             <dbl> 37411.22, 23615.06, 56103.77, 469337.41, 41064…
$ PROX_CHILDCARE_SE       <dbl> 319111.1, 299705.3, 349128.5, 304965.2, 698720…
$ PROX_ELDERLYCARE_SE     <dbl> 120633.34, 84546.69, 129687.07, 127150.69, 327…
$ PROX_URA_GROWTH_AREA_SE <dbl> 56207.39, 76956.50, 95774.60, 470762.12, 47433…
$ PROX_MRT_SE             <dbl> 185181.3, 281133.9, 275483.7, 279877.1, 363830…
$ PROX_PARK_SE            <dbl> 205499.6, 229358.7, 314124.3, 227249.4, 364580…
$ PROX_PRIMARY_SCH_SE     <dbl> 152400.7, 165150.7, 196662.6, 240878.9, 249087…
$ PROX_SHOPPING_MALL_SE   <dbl> 109268.8, 98906.8, 119913.3, 177104.1, 301032.…
$ PROX_BUS_STOP_SE        <dbl> 600668.6, 410222.1, 464156.7, 562810.8, 740922…
$ NO_Of_UNITS_SE          <dbl> 218.1258, 208.9410, 210.9828, 361.7767, 299.50…
$ FAMILY_FRIENDLY_SE      <dbl> 131474.73, 114989.07, 146607.22, 108726.62, 16…
$ FREEHOLD_SE             <dbl> 115954.0, 130110.0, 141031.5, 138239.1, 210641…
$ Intercept_TV            <dbl> 3.9720784, 3.3460017, 3.5629010, 0.5276150, 1.…
$ AREA_SQM_TV             <dbl> 11.614302, 20.087361, 13.247868, 33.577223, 4.…
$ AGE_TV                  <dbl> -1.6154474, -9.3441881, -4.1023685, -15.524301…
$ PROX_CBD_TV             <dbl> -3.22582173, -6.32792021, -4.62353528, 5.17080…
$ PROX_CHILDCARE_TV       <dbl> 1.000488185, 1.471786337, -0.344047555, 1.5766…
$ PROX_ELDERLYCARE_TV     <dbl> -3.26126929, 3.84626245, 4.13191383, 2.4756745…
$ PROX_URA_GROWTH_AREA_TV <dbl> -2.846248368, -1.848971738, -2.648105057, -5.6…
$ PROX_MRT_TV             <dbl> -1.61864578, -8.92998600, -3.40075727, -7.2870…
$ PROX_PARK_TV            <dbl> -0.83749312, 2.28192684, 0.66565951, -3.340617…
$ PROX_PRIMARY_SCH_TV     <dbl> 1.59230221, 6.70194543, 2.90580089, 12.9836104…
$ PROX_SHOPPING_MALL_TV   <dbl> 2.753588422, -0.886626400, -1.056869486, -0.16…
$ PROX_BUS_STOP_TV        <dbl> 2.0154464, 4.4941192, 3.0419145, 12.8383775, 0…
$ NO_Of_UNITS_TV          <dbl> 0.480589953, -1.380026395, -0.045279967, -0.44…
$ FAMILY_FRIENDLY_TV      <dbl> -0.06902748, 2.69655779, 0.04058290, 14.312764…
$ FREEHOLD_TV             <dbl> 2.6213469, 3.0452799, 1.1970499, 8.7711485, 1.…
$ Local_R2                <dbl> 0.8846744, 0.8899773, 0.8947007, 0.9073605, 0.…
$ geometry                <POINT [m]> POINT (22085.12 29951.54), POINT (25656.…
$ geometry.1              <POINT [m]> POINT (22085.12 29951.54), POINT (25656.…
summary(gwr_sf_adaptive$SDF$yhat)
Length  Class   Mode 
     0   NULL   NULL 

7.5.5 Visualizing local R2

The code chunks below is used to create an interactive point symbol map.

tmap_mode("view")
tm_shape(mpsz_svy21)+
  tm_polygons(alpha = 0.1) +
tm_shape(gwr_sf_adaptive) +  
  tm_dots(col = "Local_R2",
          border.col = "gray60",
          border.lwd = 1) +
  tm_view(set.zoom.limits = c(11,14))
tmap_mode('plot')